Plankton Recognition Challenge

Presentation Notebook

Group 11: Yazid Mouline & Guillaume Requena | AML Class 2020


In [0]:
# imports
from google.colab import drive
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from IPython.display import Image

import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns

Introduction

The goal of this challenge is to recognize very little organisms living in the oceans: planktons.

Provided with a set of images, metadata and also features extracted from the images by the Laboratoire d’Océanographie de Villefranche, we decided to treat this problem as a classification problem. A discussion about the classes and the taxonomy can be found in the Data Exploration part of this report.

To deal with this challenge, two main approaches were explored:

  • working directly on the images with convolutional neural networks
  • working with the extracted features

Our work is divided in three main parts:

  1. Data Preparation
  2. Model Selection
  3. Performance Evaluation

Data Importation

In [0]:
drive.mount('/content/drive/',force_remount=True)

1. Data Preparation

1.1. Data exploration

1.1.1. Metadata analysis

The taxonomy tree is huge. It aims to hierarchically identify all that could possibly be a plankton. The first separation is the living / not living. Then the living creatures are separated into Bacteria /Eukaryota / other. Beyond this, our biological knowledge is hopeless.

As said in the challenge, our target values were 'level2' column of the metadata.

In [0]:
taxo = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/taxo.csv')
meta = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/meta.csv')
features_native = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/features_native.csv')
features_skimage = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/features_skimage.csv')
X_features_10p_train = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/X_features_10p_train.csv')
X_features_10p_test = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/X_features_10p_test.csv')
y_features_10p_train = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/y_features_10p_train.csv')
y_features_10p_test = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/y_features_10p_test.csv')
In [4]:
clean_meta = meta.dropna(subset=['level2']) #deleting the rows with a NaN value in level2
meta_lvl2 = clean_meta[['objid','level2']]
fig = plt.figure(figsize=(10,10))
chart = plt.pie(x=meta_lvl2['level2'].value_counts())
plt.title('level2: Pie Chart Distribution', fontsize=20)
plt.legend(chart[0], labels=meta_lvl2['level2'].value_counts().index)
plt.show()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: UserWarning: You have mixed positional and keyword arguments, some input may be discarded.
  

It is interesting to note that more than half the images are of detritus and that close to 75% of the images are either detritus or feces.

This pie chart shows that our dataset is very much imbalanced. The use of the macro average f1 score is then questionable. The micro-average would capture this class imbalance, but the macro average alone cannot.

Still, it demonstrates how the system performs overall across the sets of data.

1.1.2. Data exploration: features exploitation approach

In addition to the images, the meta.csv file containing metadata about the images, and the taxo.csv file defining the taxonomy tree we have two other csv files.

  • features_native.csv.gz

  • features_skimage.csv.gz

These two files contain information extracted from the images in form of features. This ZooProcess method is a software developped by engineers from the Laboratoire d’Océanologie de Villefranche sur mer.

1.1.3. Data cleaning: handling missing information

In [5]:
meta.isna().sum().sort_values(ascending=False) 
Out[5]:
level1         3334
level2         1003
lineage           0
unique_name       0
depth_max         0
depth_min         0
objtime           0
objdate           0
longitude         0
latitude          0
status            0
id                0
projid            0
objid             0
dtype: int64

There are 1003 NaN values in the column level2, and 3334 in the column level1. Since 'level2' will be our target, we cannot use the images corresponding to these NaN values; it is a supervised learning problem. We deleted the rows corresponding to these NaN values, using the objid identifier.

Also, in the two features files, we removed the rows corresponding to the unclassified images.

In [6]:
train_na = (features_skimage.isnull().sum() / len(features_skimage)) * 100
train_na = train_na.drop(train_na[train_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Table 1: Missing Ratio' :train_na})
missing_data.head(20)
Out[6]:
Table 1: Missing Ratio
weighted_moments_normalized4 100.0
weighted_moments_normalized1 100.0
weighted_moments_normalized0 100.0
moments_normalized4 100.0
moments_normalized1 100.0
moments_normalized0 100.0
In [7]:
features_native.isnull().sum().sort_values(ascending=False)[:20]
Out[7]:
perimareaexc      34428
feretareaexc      34428
cdexc             34428
skeleton_area      6854
nb1_area           6854
symetrieh_area     6854
symetriev_area     6854
convarea_area      6854
nb2_area           6854
nb3_area           6854
slope                 0
feret                 0
skelarea              0
fractal               0
area_exc              0
histcum1              0
%area                 0
kurt                  0
skew                  0
histcum2              0
dtype: int64
  • features_skimage: There were six columns full of NaN values, so we simply dropped them. (moments_normalized_0/1/4 and weighted_moments_normalized_0/1/4).
  • features_native: There were some NaN values in 10 columns, we decided to simply replace these NaN values by the mean of each column.

1.1.4. Data exploration: images exploitation approach

In [11]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Distrib_nb_rows.png', width=800, height=400)
Out[11]:
In [12]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Distrib_nb_columns.png', width=800, height=400)
Out[12]:

On the plots above there is two things to notice:

  • The middle images shapes are around 87x67 (median values).
  • Some images that are really big compared to others. The min and max values show this, but also the standard deviations are quite high (72 and 62).

1.2. Data pre-processing

1.2.1. Data pre-processing: features exploitation approach

The meta dataset can be useful, but let's first remove all columns that are useless. There is only information relative to the images, the projects (when the image was taken for example). We will only keep objid and the label, hence level2.

Here are the steps we took in order to create an exploitable dataset:

  • Merge all three datasets: features_native, features_skimage and metadata.
  • Deal with redundancy: some features repeat themselves in between native, and skimage. We removed features that are highly correlated to others (more than 95%).
  • Sample the dataset. We kept random 10% of the dataset. It is too large to select fastly and effectively models. This arbitrary choice helped us try models without waiting hours of execution.
  • Scaling using MinMaxScaler.
  • Splitting into train and test sets.

1.2.2 Data pre-processing: images exploitation approach

As we will describe in the Model Selection part, we had two apporaches using images (a 'handmade' CNN and a pre-trained model from Keras). Here is how we processed the data for each of these approaches.

1.2.2.a. Handmade Convolutional Neural Network

The distribution of height and width of the images indicate that they are respectively around 80 and 60. (The values of the medians are 87 rows and 67 columns).

In order to have a lower computational cost, while still managing to be realistic, and being able to use all images we have at our diposition, we decided to:

  • Resize all images to the dimension (87, 67, 1). The one meaning it is greyscale. This will improve a lot our computational cost.

Then we would build our CNN according to these sizes of inputs.

  • Data augmentation: In order to create additional data, trying to deal with the class imbalance problem. We made some transformations on the images from the less represented classes: Flipping left right, Flipping Up Down, Rotating two times with two different angles.

After having resized and created all this images, we will store all of them along with their labels in a list that we will turn into an array later on because it is less harmful to the memory.

1.2.2.b. VGG16 pre-trained model

For this pre-trained model, we had to deal with one major constraint. It takes as inputs images of size (224,224,3) in (height, width, channels).

  • Remove small images. This arbitrary choice was made to fit the constraint without distorting the images.
  • Keeping a sample of only 15,000 images. Otherwise, because of the high dimension (224,224,3) the RAM would be full and crash the entire execution environment.
  • Finally we augmented the data as we did right above.

N.B. When we tried to create too big array the environnment will just shut down and we will need to run it again.

In [15]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Pie_chart.png', width=900, height=400)
Out[15]:

The two pie charts above represent classes distribution before and after data augmentation. We can see that thanks to data augmentation we gave more importance to the 'small' classes.

In [17]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Images.png', width=900, height=500)
Out[17]:

The images above are examples from our dataset, before and after resizing.

  • First, we can see that the human eye can easily say if an object is a detritus or something else. However, a biological knowledge would be necessary to classify all species. Also, we can imagine that a biologist can sometimes not be able to differentiate two different species. This is where data scientists intervene !
  • We can see that after resizing and turning into grayscale we still manage easily to perceive the shapes. We can hope that it will be the same for our programs, that we haven't lost much information.

2. Model Selection

2.1. Model Selection: features exploitation approach

To begin with, we decided to use cross-validation strategy for this approach. Indeed, we noticed that models tend to overfit, giving excellent results for the train set but poor results when tested.

After trying different models, we decided to stick with, and try to optimize k-Nearest-Neighbors algorithm as an easy to understand classifier. Also, we thought that Random Forest Classifier would be able to imitate the taxonomy tree in a way.

However, as we will describe briefly, we obtained very poor results, probably because the extracted features from the images don't manage to extract totally the distribution and organization of the data and its classes. That is why, as you'll see in the next part we focused more on Convolutional Neural Networks, that work directly on the images.

In [0]:
#Validation function
def f1_cv(model, X, y):
    kf = KFold(5, shuffle=True, random_state=42).get_n_splits(X)
    f1 = cross_val_score(model, X, y, scoring="f1_macro", cv = kf)
    return(f1)

2.1.1. k-Nearest-Neighbors

In [0]:
train_results = []
neighbors = [n for n in range(1,20)]

for n in neighbors:
  knn = KNeighborsClassifier(n_neighbors=n)
  knn.fit(X_features_10p_train, y_features_10p_train)
  train_pred = knn.predict(X_features_10p_train)
  train_results.append(np.mean(f1_cv(knn, X_features_10p_train, y_features_10p_train)))
  print(n)
In [0]:
fig = plt.figure(figsize=(10,6))
plt.title('Cross-validation macro f1-score wrt number of neighbors in kNN', fontsize=10)
plt.plot(neighbors, train_results)
plt.ylabel('F1 CV')
plt.xlabel('neighbors')
plt.show()

As we can see above, the kNN classifier gave middling results, and for the small number of neighbors, there is probably some overfitting taking place...

2.1.2. Random Forest Classifier

In [0]:
rfc = RandomForestClassifier(max_depth=50, min_samples_leaf=1, min_samples_split=2, n_estimators=1550)
f1_rfc = f1_cv(rfc, X_features_10p_train, y_features_10p_train)
In [23]:
pd.DataFrame([['Random Forest Classifier', np.mean(f1_rfc)]],columns=['Model','Cross-Validation f1 score'])
Out[23]:
Model Cross-Validation f1 score
0 Random Forest Classifier 0.327613

Same as for kNN, a Random Forest Classifier did not manage to give great results even with some hypertuning using GridSearchCV. These reults are only here to show that the features approach was not conclusive for us.

We believe that using directly the images with Convolutional Neural Networks is probably the way to go here.

2.2. Model Selection: images exploitation approach

In [0]:
results = pd.read_csv(('/content/drive/My Drive/AML/Chal2 - Plankton/Data/final_results.csv')).drop('Unnamed: 0', axis=1)

We trained both of our CNN models on 10 epoch. If you want more information on the training, feel free to have a look at the scratch notebook.

2.2.1. Handmade CNN

First according to what we could have seen before in the data exploration, we wanted to build a CNN that takes as an input an image which is fairly adjusted to the meidan sizes of the images from our dataset.

You can see below a description of the CNN's structure.

In [0]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/CNNsoftmax_architechture.png', width=500, height=500)
Out[0]:

2.2.2. VGG16 pre-trained CNN | Transfer Learning

Secondly, we realized that it is long to correctly train a CNN and make it fit our dataset even if the input shape fits with the average shape. But we were limited on time. So we decided to work on transfer learning and use a pre-trained model. A pre-trained model is a model that has been trained for a long time on a really large data-set that gather all kind of pictures. So the weights are already initialized and it is time-saving. The deeper you go through your CNN the more complex the forms you can identify are. So to specialize our pre-trained CNN on the plankton, we will just modify the last fully connected layer by training the CNN with our pre-processed images.

You can see below a description of the pre-trained CNN's structure.

In [0]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/VGG16_architechture.png', width=550, height=900)
Out[0]:
In [26]:
results[['Model','Training execution time (min)', 'Number of images involved', 'f1_score']]
Out[26]:
Model Training execution time (min) Number of images involved f1_score
0 Softmax CNN 5.527883 266539 0.332946
1 VGG16 27.316829 14960 0.564963

We can see that our handmade CNN takes a fifth of the time with more than 10 times the number of images. But, the obtained results are way better for the VGG16 model.

2.3. Model Selection: conclusion

First as we compare the features approach to the images approach we can notice that we have pretty bad results on the features approach. Moreover we have issues with the understanding of the feature while in the images approach we feel more confident about what is happening and what criteria makes the classification. So we decided to choose the side of the images approach.

Secondly, even though it is longer to train the pre-trained we decided to go along with the pre-trained VGG16 CNN model. According to the lack of time we had to fully train a CNN from scratch we won't be able to obtain a good CNN without using a pre-trained one. Since the beginning, it gave results close to those from the features approach.

Performance evaluation

Let's now have a look on the performance of the VGG16 pre-trained model.

In [31]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Training_validation_loss.png', width=500, height=300)
Out[31]:
In [33]:
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Best_confusion_matrix.png', width=1000, height=1000)
Out[33]:

The confusion matrix (comparing the ratio of true and predicted labels) shows that there are a lot of rightly classified images as we have high numbers on the diagonal. But we also have a lot of misclassified images considered as a 'detritus'. This is due to the predominance of the detritus in our dataset. So our CNN happens to be really sensitive to detritus.

In [0]:
results[results['Model']=='VGG16']
Out[0]:
Model f1_score Training execution time (min) Number of images involved
1 VGG16 0.564963 27.316829 14960

We finally obtained a f1_score of 0.565 on the testing set which is quite a good result.

Possible improvements

  • We can still improve these results by augmenting more the data because the detritus predominance makes our CNN sensitive to it.

  • We can also play on the hyperparameters of the CNN computation but it will take some time to run all of them and find the best one.

  • Finally we could have to tried some more tricks to have some more images in our numpy array while we train the VGG16 model. Indeed we were limited to 15.000 images because of RAM issues.

  • We stuck to this pre-trained model because it gave us good results quite fastly, but with more time and computation, a CNN made from scratch, adapted to this problem could give greater results.